Homework 2: Adversarial Bandits
نویسنده
چکیده
Notation. We will use some notation [mostly] from the class. A is the set of arms, T is the time horizon, K is the number of arms. At each round t, at is the arm chosen by the algorithm, and ct(a) is the cost of arm a. Total cost of arm a is cost(a) = ∑T t=1 ct(a). The best total cost for a subset S of arms is defined as cost∗(S) = infa∈S cost(a). The total cost of an algorithm ALG is cost(ALG) = ∑T t=1 ct(at). Regret is defined as R(T ) = cost(ALG)− cost∗(A). In all problems, assume that the costs are chosen by oblivious adversary.
منابع مشابه
An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits
We present an algorithm that achieves almost optimal pseudo-regret bounds against adversarial and stochastic bandits. Against adversarial bandits the pseudo-regret is O ( K √ n log n ) and against stochastic bandits the pseudo-regret is O ( ∑ i(log n)/∆i). We also show that no algorithm with O (log n) pseudo-regret against stochastic bandits can achieve Õ ( √ n) expected regret against adaptive...
متن کاملAn Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits
We present a new strategy for gap estimation in randomized algorithms for multiarmed bandits and combine it with the EXP3++ algorithm of Seldin and Slivkins (2014). In the stochastic regime the strategy reduces dependence of regret on a time horizon from (ln t) to (ln t) and eliminates an additive factor of order ∆e 2 , where ∆ is the minimal gap of a problem instance. In the adversarial regime...
متن کاملThe Best of Both Worlds: Stochastic and Adversarial Bandits
We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the O( √ n) worst-case regret of Exp3 (Auer et al., 2002b) and the (poly)logarithmic regret of UCB1 (Auer et al., 2002a) for stochastic rewards. Adversarial rewards and stochastic rewards are the two...
متن کاملEvaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments
EXP3 is a popular algorithm for adversarial multiarmed bandits, suggested and analyzed in this setting by Auer et al. [2002b]. Recently there was an increased interest in the performance of this algorithm in the stochastic setting, due to its new applications to stochastic multiarmed bandits with side information [Seldin et al., 2011] and to multiarmed bandits in the mixed stochastic-adversaria...
متن کاملOne Practical Algorithm for Both Stochastic and Adversarial Bandits
We present an algorithm for multiarmed bandits that achieves almost optimal performance in both stochastic and adversarial regimes without prior knowledge about the nature of the environment. Our algorithm is based on augmentation of the EXP3 algorithm with a new control lever in the form of exploration parameters that are tailored individually for each arm. The algorithm simultaneously applies...
متن کامل